Mr. Scan: A Hybrid/Hybrid Extreme Scale Density Based Clustering Algorithm

نویسندگان

  • Benjamin Welton
  • Barton P. Miller
چکیده

Density-based clustering algorithms are a widely-used class of data mining techniques that can find irregularly shaped clusters and cluster data without prior knowledge of the number of clusters the data contains. DBSCAN is the most well-known density-based clustering algorithm. We introduce our extension of DBSCAN, called Mr. Scan, which uses a hybrid/hybrid parallel implementation that combines the MRNet tree-based distribution network with GPU-equipped nodes. Mr. Scan avoids the problems encountered in other parallel versions of DBSCAN, such as scalability limits, reduction in output quality at large scales, and the inability to effectively process dense regions of data. Mr. Scan uses effective data partitioning and a new merging technique to allow data sets to be broken into independently processable partitions without the reduction in quality or large amount of node-to-node communication seen in other parallel versions of DBSCAN. The dense box algorithm designed as part of Mr. Scan allows for dense regions to be detected and clustered without the need to individually compare all points in these regions to one another. Mr. Scan was tested on both a geolocated Twitter dataset and image data obtained from the Sloan Digital Sky Survey. In testing Mr. Scan we performed end-to-end benchmarks measuring complete application run time from reading raw unordered input point data from the file system to writing the final clustered output to the file system. The use of end-to-end benchmarking gives a clear picture of the performance that can be expected from Mr. Scan in real world use cases. At its largest scale, Mr. Scan clustered 6.5 billion points from the Twitter dataset on 8,192 GPU nodes on Cray Titan in 7.5 minutes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

The emergence of leadership-class systems with GPU-equipped nodes has the potential to vastly increase the performance of existing distributed applications. An increasing number of applications that are converted to run on these systems are reliant on algorithms that perform computations on spatial data. Algorithms that operate on spatial data, such as density-based clustering algorithms, prese...

متن کامل

Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms

In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...

متن کامل

Tabu-KM: A Hybrid Clustering Algorithm Based on Tabu Search Approach

  The clustering problem under the criterion of minimum sum of squares is a non-convex and non-linear program, which possesses many locally optimal values, resulting that its solution often falls into these trap and therefore cannot converge to global optima solution. In this paper, an efficient hybrid optimization algorithm is developed for solving this problem, called Tabu-KM. It gathers the ...

متن کامل

Generating Optimal Timetabling for Lecturers using Hybrid Fuzzy and Clustering Algorithms

UCTTP is a NP-hard problem, which must be performed for each semester frequently. The major technique in the presented approach would be analyzing data to resolve uncertainties of lecturers’ preferences and constraints within a department in order to obtain a ranking for each lecturer based on their requirements within a department where it is attempted to increase their satisfaction and develo...

متن کامل

A Hybrid Grey based Two Steps Clustering and Firefly Algorithm for Portfolio Selection

Considering the concept of clustering, the main idea of the present study is based on the fact that all stocks for choosing and ranking will not be necessarily in one cluster. Taking the mentioned point into account, this study aims at offering a new methodology for making decisions concerning the formation of a portfolio of stocks in the stock market. To meet this end, Multiple-Criteria Decisi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015